2016-09-21: Introduction to scikit-learn

This tutorial is based on resources from http://scikit-learn.org and the scikit-learn tutorial at Scipy 2013 by Gael Varoquaux, Jake VanderPlas and Olivier Grisel (https://www.youtube.com/watch?v=r4bRUvvlaBw and https://github.com/jakevdp/sklearn_scipy2013).

Machine learning is about creating models from data: for that reason, we’ll start by discussing how data can be represented in order to be understood by the computer. Along with this, we’ll build on our matplotlib examples from the previous lab and show some examples of how to visualize data.

Most machine learning algorithms implemented in scikit-learn expect data to be stored in a two-dimensional array, of shape (n_samples, n_features). The arrays can be either NumPy arrays, or in some cases scipy.sparse matrices. The number of features must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample. This is a case where scipy.sparse matrices can be useful, in that they are much more memory-efficient than NumPy arrays.

1. The endometrium vs uterus cancer data

For this lab and many of the coming ones, we will work with gene expression data measured on endometrium and ovary tumors. The data originally comes from http://gemler.fzv.uni-mb.si/index.php but has been downsized so that it is easier to work with in our labs.

The data we will work with contains the expression of 3,000 genes, measured for 61 endometrium tumors and 123 uterus tumors.

Loading the data

The data is stored in a CSV file, small_Endometrium_Uterus.csv. Many utilities can be used to read CSV files, including the csv module and the pandas module (that is meant for data mining). We will focus on using numpy directly.

A quick look at the file shows us that

its first line is a header
each of its lines is a tumor sample.
the first column of each line is the tumor sample's ID
the last column of each line is the type of tissue (Endometrium or Uterus)
all columns in between give the expression of one gene.



In [1]:

    
import numpy as np



In [2]:

    
# Load the gene expression data
X = np.loadtxt('data/small_Endometrium_Uterus.csv',  delimiter=',', 
               skiprows=1, usecols=range(1, 3001))



In [3]:

    
# Load the labels (Python )
y = np.loadtxt('data/small_Endometrium_Uterus.csv', delimiter=',', 
               skiprows=1, usecols=[3001], dtype='string')

# Python 3 alternative:
#y = np.loadtxt('data/small_Endometrium_Uterus.csv', delimiter=',', 
#               skiprows=1, usecols=[3001], dtype='bytes').astype('str')

# Convert 'Endometrium' to 0 and 'Uterus' to 1
y = np.where(y=='Endometrium', 0, 1)
print y









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-7c2cd1c0934d> in <module>()
      5 # Python 3 alternative:
      6 y = np.loadtxt('data/small_Endometrium_Uterus.csv', delimiter=',', 
----> 7                skiprows=1, usecols=[3001], dtype='bytes').astype('str')
      8 
      9 # Convert 'Endometrium' to 0 and 'Uterus' to 1

/usr/lib/python2.7/dist-packages/numpy/lib/npyio.pyc in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin)
    796     try:
    797         # Make sure we're dealing with a proper dtype
--> 798         dtype = np.dtype(dtype)
    799         defconv = _getconv(dtype)
    800 

TypeError: data type "bytes" not understood



In [ ]:

    
# Python 2.7 only: We could have done the previous operation in one step, using a converter
y = np.loadtxt('data/small_Endometrium_Uterus.csv', delimiter=',', skiprows=1, usecols=[3001], 
               converters={3001: lambda s: 0 if s=='Endometrium' else 1}, dtype='int')
print y

Visualizing the data



In [ ]:

    
# Load pylab
%pylab inline



In [ ]:

    
# Visualize the data across 2 of its axes
idx_1 = 0 # first axis (or gene)
idx_2 = 1 # second axis (or gene)
plt.scatter(X[y==0, idx_1], X[y==0, idx_2], # y==0 gives the indices of the columns for which y is equal to 0
            color='blue', label='Endometrium')
plt.scatter(X[y==1, idx_1], X[y==1, idx_2], color='orange', label='Uterus')
plt.legend(scatterpoints=1)
plt.xlabel('Gene %d' % idx_1, fontsize=14)
plt.ylabel('Gene %d' % idx_2, fontsize=14)

Play with the values of idx_1 and idx_2. What do you notice about the range of the values taken by the different genes?

2. Gaussian Naive Bayes

Documentation: http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

Our goal here is to try to classify points between endometrium and uterus tumors.



In [ ]:

    
# import Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB



In [ ]:

    
# create a Gaussian Naive Bayes classifier i.e. an instance of GaussianNB
gnb = GaussianNB()



In [ ]:

    
# fit the classifier to the data
gnb.fit(X, y)



In [ ]:

    
# predict on the same data
y_pred = gnb.predict(X)



In [ ]:

    
# compute the number of mislabeled genes
print("Number of mislabeled points out of a total %d points : %d" % \
      (X.shape[0], (y != y_pred).sum()))



In [ ]:

    
# Score the model
from sklearn import metrics
print "Accuracy: %.3f" % metrics.accuracy_score(y, y_pred)

ROC Curve

Building an ROC curve requires to use the probability estimates for the test data points before they are thresholded.



In [ ]:

    
# Get probability estimates
y_prob = gnb.predict_proba(X)
print y_prob.shape

Question: y_prob returns two values for each data point because it returns one probability estimate per class for each data point. The order in which the classes appear are given by gnb.classes_. How do you get the 1-dimensional array that only contains the estimated probability for each point to belong to the positive class?



In [ ]:



In [ ]:

    
# ROC curve
fpr, tpr, thresholds = metrics.roc_curve(y, y_prob[:, 1], pos_label=1)

# Area under the ROC curve
auc = metrics.auc(fpr, tpr)

# Plot the ROC curve
plt.plot(fpr, tpr, '-', color='orange', label='AUC = %0.3f' % auc)

plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve: Gaussian Naive Bayes', fontsize=16)
plt.legend(loc="lower right")

Question: What are the parameters of the model we have trained? How many of them are they? How can you access them?

Question: What is it problematic to have evaluated our classifier on the training data?

3. Cross-validation



In [ ]:

    
# Set up a stratified 10-fold cross-validation
from sklearn import cross_validation
folds = cross_validation.StratifiedKFold(y, 10, shuffle=True)



In [ ]:

    
# This is one way to access the training and test points
for ix, (tr, te) in enumerate(folds):
    print "Fold %d" % ix
    print "\t %d training points" % len(tr)
    print "\t %d test points" % len(te)

Question: Create a loop that goes through all 10 folds and for each fold:

trains a Gaussian Naive Bayes model on the training data
uses this model to make predictions on the test data. In this fashion you should be able to form a single vector of predictions y_prob_cv (as each point from the data appears once as a test point in the cross-validation).



In [ ]:

Question: Evaluate the performance of your model (accuracy, AUC, ROC-AUC). How does it compare to the one you previously obtained?



In [ ]:

Question: The sklearn.cross_validation module provides some utilities to make cross-validated predictions. Compare the results you obtained to what they return.



In [ ]:

    
gnb = GaussianNB()
cv_aucs = cross_validation.cross_val_score(gnb, X, y, cv=folds, scoring='roc_auc')
print np.mean(cv_aucs)

# Note that averaging the AUCs obtained over 10 folds is not exactly the same as 
# globally computing the AUC for the predictions made within the cross-validation loop.



In [ ]:

    
gnb = GaussianNB()
print "Cross-validated accuracy: %.3f" % \
    metrics.accuracy_score(y, cross_validation.cross_val_predict(gnb, X, y, cv=folds))
# This should return the same as 
print "Cross-validated accuracy: %.3f" % metrics.accuracy_score(y, np.where(y_prob_cv > 0.5, 1, 0))

4. Iris data

As an example of a simple dataset, we’re going to take a look at the Iris data stored by scikit-learn. The data is a classical UCI dataset and consists of measurements of three different species of irises.

You can learn more about this data set here: https://en.wikipedia.org/wiki/Iris_flower_data_set.

Question: What features could you use to represent irises in order to decide to which of the three species they belong?

scikit-learn has a very straightforward set of data on these iris species. The data consist of the following:

Features in the Iris dataset:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
Target classes to predict:
- Iris setosa
- Iris versicolour
- Iris virginica

Loading the data



In [ ]:

    
from sklearn.datasets import load_iris
iris = load_iris()



In [ ]:

    
# iris is a Bunch object
# see what’s available in iris:
iris.keys()



In [ ]:

    
# The features of each sample flower are stored in the data
# attribute of the dataset:
n_samples, n_features = iris.data.shape
print n_samples
print n_features
print iris.data[0]



In [ ]:

    
# The information about the class of each sample is stored in the target
# attribute of the dataset:
print iris.target.shape
print iris.target



In [ ]:

    
# The names of the classes are stored in the last attribute,
# namely target_names:
print iris.target_names

Visualizing the data



In [ ]:

    
# This data is four dimensional, but we can visualize two of the dimensions at a time using a simple scatter-plot.
%pylab inline
idx_1 = 0 # index of the first feature to plot
idx_2 = 1 # index of the second feature to plot

# Scatterplot
plt.scatter(iris.data[:, idx_1], iris.data[:, idx_2], s=40, c=iris.target)

# Create a formatter that will label the colorbar with the correct target names.
# lambda i, *args: iris.target_names[int(i)] converts an index i into the corresponding label
formatter = plt.FuncFormatter(lambda i, *args: iris.target_names[int(i)])
plt.colorbar(ticks=[0, 1, 2], format=formatter)

plt.xlabel(iris.feature_names[idx_1], fontsize=14)
plt.ylabel(iris.feature_names[idx_2], fontsize=14)

Question: Change idx_1 and idx_2, and find (visually) which combination of two parameters best separates the three classes.

Question: Create a Gaussian Naive Bayes classifier and train it on the Iris data, using only the two features you have identified in the previous question. How many instances are misclassified?



In [ ]:

5. Digits data

Let us now take a look at the Digits dataset. This data set, also very classical, contains images of hand-written digits, each labeled with the corresponding digit. This means there are 10 classes in this data. Each image has been compressed into an 8 × 8 matrix.

Loading the digits data



In [ ]:

    
# Load the digits dataset
from sklearn.datasets import load_digits
digits = load_digits()



In [ ]:

    
# Each data point is an image, and has for target the number it represents
print np.unique(digits.target)

What is the difference between digits.data and digits.images? Actually, digits.data is simply digits.images reshaped, with the 8×8 array flattened out in a 64-dimensional vector. These two representations can exist without needing to fully duplicate the data, and hence the memory overhead is very small. Indeed, the two arrays, although they have different shapes, point at the same memory block. To check this is true:



In [ ]:

    
# check the two arrays contain identical data
print np.all(digits.images.reshape((1797, 64)) == digits.data)

# check the two arrays point to the same memory address
print digits.data.__array_interface__['data']
print digits.images.__array_interface__['data']

Visualizing the digits data



In [ ]:

    
# set up a figure that will contain multiple subplots
fig = plt.figure(figsize=(6, 6)) # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

# plot the 49 first digits: each image is 8 by 8 pixels
for i in range(49):
    ax = fig.add_subplot(7, 7, i + 1, xticks=[], yticks=[])
    ax.imshow(digits.images[i], cmap=plt.cm.Greys, interpolation='nearest')
    
    # label the image with the target value
    ax.text(0, 6, str(digits.target[i]))
plt.show()

Question: What is the effect of the interpolation='nearest' parameter?

Question: Modify the above script to visualize all the “8” contained in the data. Use the cmap option of imshow to display the data with a colormap that is more appropriate for visualizing intensities.



In [ ]:

Question: Create a Gaussian Naive Bayes classifier and train it on the Digits data. Look at some of the mislabeled digits. What were their actual class, what digit were they mistaken for? Can you guess why?



In [ ]: